-
Notifications
You must be signed in to change notification settings - Fork 29.1k
[SPARK-38215][SQL] InsertIntoHiveDir should use data source if it's convertible #35528
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Gentle ping @cloud-fan @viirya Could you take a review? it's a useful feature. |
|
Also ping @dongjoon-hyun @HyukjinKwon |
|
|
||
| private def convertProvider(storage: CatalogStorageFormat): String = { | ||
| val serde = storage.serde.getOrElse("").toLowerCase(Locale.ROOT) | ||
| Some("parquet").filter(serde.contains).getOrElse("orc") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit:
if (serde.contains("parquet")) parquet else orc
is much simpler
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if (serde.contains("parquet")) parquet else orc
updated
| * - When writing to partitioned Hive-serde Parquet/Orc tables when | ||
| * `spark.sql.hive.convertInsertingPartitionedTable` is true | ||
| * - When writing to directory with Hive-serde | ||
| * - When writing to non-partitioned Hive-serde Parquet/ORC tables using CTAS |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@cloud-fan Update the comment of this rule, also add comment about CTAS
|
Gentle ping @cloud-fan GA passed. |
|
thanks, merging to master! |
|
@AngersZhuuuu Hi,in the case of inserted dir has same path as selected table location, this may cause error. https://issues.apache.org/jira/browse/SPARK-38215 |
What changes were proposed in this pull request?
Currently spark sql
can't be converted to use InsertIntoDataSourceCommand, still use Hive SerDe to write data, this cause we can't use feature provided by new parquet/orc version, such as zstd compress.
Why are the changes needed?
Convert InsertIntoHiveDirCommand to InsertIntoDataSourceCommand can support more features of parquet/orc
Does this PR introduce any user-facing change?
No
How was this patch tested?
Added UT